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ABSTRACT 

Dictionary learning algorithms have been successfully used in 
both reconstructive and discriminative tasks, where the input 
signal is represented by a linear combination of a few dictio¬ 
nary atoms. While these methods are usually developed under 
£i sparsity constrain (prior) in the input domain, recent stud¬ 
ies have demonstrated the advantages of sparse representation 
using structured sparsity priors in the kernel domain. In this 
paper, we propose a supervised dictionary learning algorithm 
in the kernel domain for hyperspectral image classification. In 
the proposed formulation, the dictionary and classifier are ob¬ 
tained jointly for optimal classification performance. The su¬ 
pervised formulation is task-driven and provides learned fea¬ 
tures from the hyperspectral data that are well suited for the 
classification task. Moreover, the proposed algorithm uses a 
joint (£ 12 ) sparsity prior to enforce collaboration among the 
neighboring pixels. The simulation results illustrate the effi¬ 
ciency of the proposed dictionary learning algorithm. 

Index Terms — Dictionary learning. Kernel methods, Hy¬ 
perspectral image classification 

1. INTRODUCTION 

Hyperspectral Imagery (HSI) has increasingly become popu¬ 
lar for the remote sensing applications such as target detec¬ 
tion m and material identification m. Among several algo¬ 
rithms used for HSI classification mm®, it has been shown 
that sparse representation classification (SRC) can achieve su¬ 
perior results EH. For this purpose, a dictionary is usually 
constructed by collecting all the training samples, i.e. labeled 
pixels, and the underlying assumption is that the test pixel 
can be approximated with a few dictionary atoms, i.e., test 
pixel lies in a low-dimensional subspace formed by the train¬ 
ing samples that have the same label as the test pixel. How¬ 
ever, the sparse coefficients generated by SRC can become 
unstable due to the high coherency of the dictionary atoms El. 
This situation can be alleviated by enforcing similarity in the 
sparse codes of the neighboring pixels, which usually have 
similar spectral features, by an appropriate structured sparsity 
prior mm. In particular, the joint sparsity prior assumes 
that the neighboring pixels lie in the same low-dimensional 


subspace. It enforces collaboration among these pixels and 
yields more stable sparse coefficients, which results in an im¬ 
proved classification performance CD. 

Recently, it has been shown that learning the dictionary, 
rather than constructing it by using all the training sam¬ 
ples, can significantly improve the performance of sparse 
representation-based algorithms for both reconstructive m 
and discriminative tasks EQ. Dictionary learning algorithms 
can generally be categorized into two groups: unsupervised 
and supervised methods. Unsupervised dictionary learning is 
aimed at finding a dictionary that yields the minimum errors 
for reconstruction tasks such as deniosing m, while su¬ 
pervised dictionary learning algorithms utilize the labels for 
minimizing a misclassification cost 0 . It has recently been 
shown that a task-driven formulation can achieve state-of- 
the-art performance in several classification tasks by jointly 
learning the dictionary and classifier 03- 

Similar to other machine learning methods, kernelized 
sparse representation algorithms which map the input into a 
higher-dimensional feature space using kernel function can 
result in significant performance improvements compared to 
the linear counterpart ESI HD. The rational is that when 
the data from different classes are projected into the kernel 
induced feature space, the classes become more separable 
and samples from the same classes can typically cluster to¬ 
gether in subspaces resulting in more discriminative sparse 
codes. For this purpose, a few kernelized dictionary learning 
algorithms have been proposed lfT8l 1191 . In [02, an unsu¬ 
pervised learning is proposed by kernelizing the well-known 
K-SVD ||201 algorithm for object recognition. In fl9l . a su¬ 
pervised formulation has been proposed based on the Hilbert 
Schmidt independence criterion to maximize the dependency 
between the data and corresponding class labels. However, 
for a classification task, the preference is to utilize the labeled 
data to minimize a misclassification cost E3- 

In this paper, a kernelized task-driven dictionary learning 
algorithm is proposed in which a dictionary is trained to be 
optimal for HSI classification. The proposed algorithm gen¬ 
eralizes the task-driven formulation of 03 in two important 
ways. First, it enforces correlation among the neighboring 
pixels using the joint sparsity prior. Second, it generalizes the 
algorithm by providing a kernelized formulation. The pro- 


posed dictionary learning is obtained by solving a bi-level op¬ 
timization problem which shows that, while the underlining 
joint sparse coding is non-smooth, the bi-level optimization 
cost is differentiable. The simulation results demonstrate that 
the proposed algorithm achieve state-of-the-art performance 
for HSI classification tasks. 

2. BACKGROUND 
2.1. Dictionary learning 

Dictionary learning has been widely used in various tasks 
such as reconstruction, classification, and compressive sens¬ 
ing Ifl5l[2TH221 . Let X = [xi, X 2 ,..., xjy] G R" xAr be the 
collection of N (normalized) training HSI pixels where n is 
the number of the spectral bands. In an unsupervised formu¬ 
lation, the dictionary D G R" xrf is usually obtained as the 
minimizer of the following cost |[23l 

g (D) = E x [l u (x, D)\, (1) 

over the regularizing convex set T> = {D G R nxd |||d fc || £2 < 
1, Vfc = 1,..., d}, where dk is the k th column, or atom, in 
the dictionary and the unsupervised loss l v is defined as 

l u (®,£>) = min \\x - Da ||| + Ai[|a||i + A 2 ||ck||1, (2) 

which is the optimal value of the sparse coding problem with 
Ai and A 2 being the regularizing parameters. It is assumed 
that the data x is drawn from a finite probability distribution 
p(x) which is usually unknown. A stationary point of the 
optimization problem can be efficiently obtained by an online 
optimization algorithm Il23l . 

The trained dictionary can then be used to (sparsely) re¬ 
construct the inputs and the reconstruction error is usually a 
robust measure for classification tasks l24l|25| . Other use of 
the trained dictionary is for feature learning where the sparse 
code a*(x, D), obtained as a solution of (0, is used as input 
feature for training a classifier in the classical expected risk 
optimization framework However, it has been shown 

that a more discriminative features can generally be obtained 
by learning the dictionary and classifier jointly in the follow¬ 
ing task-driven formulation El 

min E yiX [l su (y,W,a^x,D))] + ^\\W\\ 2 F , (3) 

ij^lu ,yv e w z 

where y G K c " is a binary vector representing the ground 
truth label of the input x for a C'-class classification problem, 
and l su is a (supervised) convex loss function that measures 
how well one can predict y given the feature a* and model 
parameters W G W, and v is the regularizing parameter. In 
this paper, quadratic loss is used which is defined as 

l su (y,W 1 a*) = ±\\y-Wa*\\l, (4) 

and W = R Cxd . 


2.2. Kernelized sparse representation with structured 
sparsity prior 

Kernel methods are usually used to project the data set into a 
higher dimensional feature space to make different classes to 
become linearly separable. Let <f> : R n — > T be a mapping 
from R” to feature space T which can possibly be infinite- 
dimensional. It is assumed that T is a Hilbert space which 
allows the use of Mercer kernels to carry out the projection 
implicitly. Mercer kernel k(®i,® 2 ) : R™ x R” —>• 1Z is a 
function defined as k(®i,® 2 ) =< <I>(®i), <f>(® 2 ) > where 
<> is the inner product operator If26l . Among commonly 
used kernel functions are the Gaussian kernel k(®i. x^) = 

exp ^ and polynomial kernel k(®i, X 2 ) = (< 

x l7 x 2 >) c , where a and c are the kernel parameters. 

The kernel sparse representation of the input feature ( [>(x) 
can then be obtained by solving El 

min ||$(®) - $(D)a\\l + Ai||a||i + A 2 ||«||i, 

aeR d 

where $(£)) = [$(di)... $(djv)] and dj are the columns of 
D. Note that ||<J>(rc) — <&(D)a\\\ = k(®, x)—2a T k (D, x)+ 
a T k(D. D)a and no explicit mapping into the feature space 
is required to solve the optimization problem. As discussed 
in previous section, the neighboring HSI pixels usually have 
similar spectral features and more robust sparse codes can 
be obtained if they are jointly reconstructed El El. Let 
{a; 1 ,...,® 5 } be the set of S neighboring pixels centered at 
x 1 which are denoted as { x s } in this paper. Joint sparsity en¬ 
forces the neighboring pixels to be represented in the same 
subspace and the optimal sparse coefficients A*({.® s }, D) 
are obtained by solving following optimization problem 

1 S \ 
argmin-^] ||$(® s )-$(D)a s ||2 + Ai||A|| 42 + ^.||A|||., 

(5) 

where a s is the sparse code for pixel ® s and | A j 1 12 = 
1 ||Oj_). ||2 in which a^’s are the rows of A. The above 
optimization problem encourages row sparsity in A* and 
therefore the neighboring pixels are enforced to be jointly 
reconstructed by the same sparse code pattern CD. 

3. KERNELIZED TASK-DRIVEN DICTIONARY 
LEARNING 

This section extends the task-driven dictionary learning algo¬ 
rithm by using joint sparsity prior, which enforces collabora¬ 
tion among the neighboring HSI pixels. Moreover, we extend 
the algorithm to the kernel domain which provides a general 
framework for task-driven dictionary learning using arbitrary 
kernel functions. With the same notations from previous sec¬ 
tion, and without loss of generality, let the input signal con¬ 
sist of S neighboring pixels {® s } centered at x 1 and the label 



vector of the center pixel be y . We propose to obtain the dic¬ 
tionary D * and the model parameter W* jointly in the kernel 
space as the minimizer of the following optimization 


min E 
.DeD.wew 


l su (y,W,cx*\{x s },D)) 


~\\W\\* F , (6) 


where a* 1 is the first column of the minimizer A*({at s , £) s }) 
of the optimization problem which is the sparse code for 
the center pixel, and l su is defined in Eq. i[4j. It should be 
noted that while l su is chosen to be the quadratic loss for sim¬ 
plicity, the formulation can be easily extended to any other 
convex cost functions such as those used in H321. The expec¬ 
tation is taken with respect to the joint probability distribution 
of the HSI inputs {rc s } and label y. 

The main difficulty in optimizing ((6)) is the nondifferen¬ 
tiability of A*({x s , D s }). However, it can be shown that the 
sparse coefficients A* is differentiable almost everywhere. To 
prove that, one can use the optimality condition of A* 

' [k(d j ,x 1 )...k(d j ,x s )] -k(dj,D)A* 


- A 2 a*_, = Ai || , if ||a}_,||* 2 ^ 0, 

\\ a j—y\\^2 

|| [k(dj, x 1 )... k(dj, a; 5 )] — k (dj,D)A* 

, - A 2 a*_ ) ,||£ 2 < Ai, otherwise, 


(7) 


which is obtained by subgradient of the cost function. For the 
solution A *, the active set is defined to be 


A = {je{l,...,d}:||oj_ > ||/ 2 ^0}, (8) 

where is the j th row of A*. It can be shown that the 
active set is locally constant for the small perturbation of 
{x s },D and, therefore. A* is locally differentiated. More¬ 
over, similar to the procedure in Itl5ll27l . it can be shown that 
the set of points where the active set changes has measure 
zero and therefore E \l su (y. W. a* 1 )] is differentiable on 
V x W, and the gradients can be computed using chain rule. 
The detailed proof is a bit involved and is omitted here due 
to the space limitation. The algorithm to find the optimal 
dictionary D and model parameter W* for HSI classification 
is described in Algorithm!]] In the special case when 5=1 
and linear kernel is chosen, the proposed algorithm reduces 
to the task-driven dictionary learning algorithm in lfl5l . In 
theory, one needs to select A 2 in Eq. i(5]i to be strictly pos¬ 
itive which guarantees the linear equation in the algorithm 
(step 7) to have unique solution. In other words it is easy 
to show that the matrix (k(D\, D\) (g> / + AiA + A 2 J) 
in Algorithm Q] is positive definite given Ai > 0,A 2 > 0. 
However, in practice it is observed that setting A 2 to zero 
yields satisfactory results. As in any nonconvex optimization 
problem, if the algorithm is not initialized properly, it may 
yield poor performance. In this paper, we used unsupervised 
dictionary learning with stochastic gradient descent to initial¬ 
ize D. Once dictionary D is initialized, the initial value of 
W is set by solving (O only with respect to W which is a 
convex optimization problem. 


Algorithm 1 Stochastic gradient descent algorithm for the kernel- 
ized task-driven dictionary learning under the joint sparisty prior 


Input: Kernel function k, neighborhood size S, Regularization parameters 
Aj,A 2 ,v, learning rate parameters p. to, number of iterations T. initial 
dictionary DtP, and initial model parameter W € W. 

Output: Learned D and W 

1: for t = 1,..., T do 

2: Draw a sample (a: 1 ,..., x s , y) where x 1 is a training pixel ran¬ 

domly selected from the training set with label y and (a; 2 ,..., x s ) 
are its closest (S — 1) HSI pixels. 

3: Find solution A* = [ct* 1 .. . ct* 5 ] = [aj^ T ... a d _^ T ] T ^ 

W d xS of the joint sparse coding problem 0- 

4: Compute the set of active rows A of A* using GJ. 

5: Let Da E R nX l A l and Wa E R^*^ be formed by the columns 

of D and W which are indexed in A. 


Compute A = Ai 


MAI 


|| a* 

3 - _ j - - 

identity matrix, and ® is the direct sum operator. 
Compute f3 E R d5 as: 


E R s 'l A lxS , |A| 5 w [j ere /±. — 
E R SxS ,Wj E A, I is the 


/3 T c = 0, f3r = (k(L>A, Da) <8> / + Ai A + X 2 I) x g, 


8 : 

9: 


where T = + d ,... ,j + (S — l)d}, (3? i s a vector in 

RI t I whose rows are those of (3 indexed by T, <S> is the Kronecker 
product, g = vec [(WA — Y) T Wa) , A = [a:* 1 ,0,. .. , 0] E 
R dxS , Y = [y , 0,. .. 0, ] E R CxS , and vec(.) is the vectorization 
operator. 

Choose the learning rate p t <— min(p, 

Update the parameters by a projected gradient step: 


W <- W 


d 4 — n x> 


- p t ((Wa* 1 - y)a* lT + uw) , 
s 

D-p t J2{[ k '( xS ’ d l)- k '( D ’ d i)c 


k '(x 3 ,d d ) - k'(£>, d d )a 3 *] diag (/3 S ) 

- [k'(D, dJAraf* ... k'(£», d d )(3 s a s d *] 


where s = {s, s + S,..., s + (d — 1)S} and k '(D,d k ) = 

r 9k(d 1 ,d ) .) dk(d d ,d k ) ~l j „ nX( 2 
dd, ' ' ■ dd. \ c 

10: end for 


4. RESULTS AND DISCUSSION 

The performance of the proposed HSI classification algorithm 
is evaluated on the Indian Pine image, which is generated by 
Airborne Visible/Infrared Imaging Spectrometer (AVIRIS), 
and the University of Pavia image. The Indian Pine image 
contains 16 classes spread over the 145 x 145 pixels and 
each pixel has 220 bands ranging from 0.2 to 2Aym. The 
20 bands corresponding to the water absorption are removed 
before processing the image. Similar to the setup in ifTTll . we 
randomly select 997 pixels (10.64% of the available data) to 
form the training set and the rest of the pixels are used for test¬ 
ing. The University of Pavia image is an urban image and has 
115 spectral bands ranging from 0.43 to 0.86 ym. It contains 
9 classes spread over the 610 x 340 pixels. The 12 noisi¬ 
est bands are removed. For this dataset, the standard train¬ 
ing and test split is used CD where the training set consists 

















Table 1. Average and overall accuracy obtained for HSI classification of the Indian Pine image. 



SVM-1 

SVM-k 
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SDL-G-k 
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SDL-/i 2 -k 





Dictionary 

size d = 997 



Dictionary 

size d = 80 


OA 

AA 

64.94 

56.53 

75.78 

61.40 

71.88 

64.28 

74.83 

67.19 

76.41 

64.67 

77.41 

63.66 

81.43 

66.43 

83.48 

74.65 

84.14 

76.56 

87.56 

81.25 

Table 2. Average and overall accuracy obtained for HSI classification of the University of Pavia image. 

SVM-1 SVM-k" SRC-G-1 SRC-G-k SRC-G 2 -I SRC-4 2 -k SDL-G-1 SDL-4-k SDL-/ 12 -I SDL-G 2 -k 





Dictionary : 

size d = 3921 



Dictionary 

size d = 45 


OA 

AA 

61.84 

65.09 

62.43 

72.14 

66.51 

75.98 

74.05 

80.06 

83.86 

86.29 

82.67 

85.28 

69.30 

83.44 

81.25 

82.24 

84.48 

84.47 

86.07 

87.37 


of 3,921 pixels (10.64% of the available data) and the rest 
40,002 pixels are used for testing. For the dictionary learning 
algorithms, the size of the dictionary is chosen to be 5 atoms 
per class. The regularization parameters Ai and v and Gaus¬ 
sian kernel parameter a are selected using cross-validation 
on the sets {0.001, 0.01, 0.1}, {10“ 8 ,10“ 7 ,..., 10" 1 }, and 
{0.5,1,..., 5}, respectively, and A 2 is set to zero. The learn¬ 
ing parameters p and to are selected similar to the procedure 
outlined in lfl5l . 

The performance of the proposed kernelized dictionary 
learning algorithm is compared with the linear task-driven 
dictionary learning algorithm (SDL-/)-Ij proposed in fl5l . 
For this purpose, we report the results of our proposed al¬ 
gorithm using three different settings which are named as 
SDL-G-k, SDL-G 2 -I, SDL-G 2 -k. The SDL-^i-k is the ex¬ 
tension of the SDL-/)-I to the kernel domain. The SDL-/ 12 -I 
is the enforcing collaboration of the neighboring pixel using 
the joint sparsity and in the linear domain. Finally, the SDL- 
/) 2 -k is the setting where the neighboring pixels are jointly 
reconstructed in the kernel domain. We also evaluate the per¬ 
formance of the proposed algorithm against linear and kernel 
SVM, namely SVM-1 and SVM-k respectively, as well as the 
sparse-based representation classification algorithms. For the 
latter, all the training samples are used to construct the dic¬ 
tionary and the results are reported using /) and /).2 priors in 
both linear and kernel domains which are named as SRC-/:)-I, 
SRC-/)-k, SRC-£i 2 -l, and SRC-/) 2 -k, accordingly. 

The classification results on the Indian Pine and Univer¬ 
sity of Pavia hyperspectral Images are shown in Table Q] and 
Table [2] respectively. As expected, the kernelized formula¬ 
tions usually achieve better classification performance. More¬ 
over, it is consistently observed that using joint sparsity prior 
(£12 norm) to enforce collaboration among the neighboring 
pixels improves the performance. The proposed SDL-/) 2 -k 
achieves the best performance against the competitive algo¬ 
rithms for both datasets. In comparing the performances of 
the dictionary-learning based algorithms with those in which 
the dictionary is constructed by collecting all the training 
samples, one should also note the difference in the dictionary 


sizes. The proposed task-driven formulations achieve the 
better performances with more compact dictionaries which 
translates into more computationally efficient processing of 
the test samples. 

5. CONCLUSIONS 

In this paper, a kernelized task-driven dictionary learning al¬ 
gorithm is proposed for supervised HSI classification. The 
proposed formulation enjoys a joint sparsity prior which en¬ 
forces collaboration among the neighboring pixels for robust 
sparse representation. It is shown that the proposed algorithm, 
equipped with compact dictionary, achieves state-of-the-art 
performances for classification of the Indian Pine and the Uni¬ 
versity of Pavia hyperspectral images. The proposed formu¬ 
lation provides a general framework for nonlinear supervised 
dictionary learning that can be readily applied to other classi¬ 
fication tasks. Future research topics includes extension of the 
proposed algorithm to include other structured sparsity priors 
and testing them on different classification tasks. 
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